Wikification for Scriptio Continua
نویسندگان
چکیده
The fact that Japanese employs scriptio continua, or a writing system without spaces, complicates the first step of an NLP pipeline. Word segmentation is widely used in Japanese language processing, and lexical knowledge is crucial for reliable identification of words in text. Although external lexical resources like Wikipedia are potentially useful, segmentation mismatch prevents them from being straightforwardly incorporated into the word segmentation task. If we intentionally violate segmentation standards with the direct incorporation, quantitative evaluation will be no longer feasible. To address this problem, we propose to define a separate task that directly links given texts to an external resource, that is, wikification in the case of Wikipedia. By doing so, we can circumvent segmentation mismatch that may not necessarily be important for downstream applications. As the first step to realize the idea, we design the task of Japanese wikification and construct wikification corpora. We annotated subsets of the Balanced Corpus of Contemporary Written Japanese plus Twitter short messages. We also implement a simple wikifier and investigate its performance on these corpora.
منابع مشابه
Enhancing Query-oriented Summarization based on Sentence Wikification
Query-oriented summarization is primarily concerned with synthesizing an informative and well-organized summary from a document collection for a given query. In the existing summarization methods, each individual sentence in the document collection is represented as Bag of Words (BOW). In this paper, we propose a novel framework which improves query-oriented summarization via sentence wikificat...
متن کاملRelational Inference for Wikification
Wikification, commonly referred to as Disambiguation to Wikipedia (D2W), is the task of identifying concepts and entities in text and disambiguating them into the most specific corresponding Wikipedia pages. Previous approaches to D2W focused on the use of local and global statistics over the given text, Wikipedia articles and its link structures, to evaluate context compatibility among a list ...
متن کاملTowards Improving Dialogue Topic Tracking Performances with Wikification of Concept Mentions
Dialogue topic tracking aims at analyzing and maintaining topic transitions in on-going dialogues. This paper proposes to utilize Wikification-based features for providing mention-level correspondences to Wikipedia concepts for dialogue topic tracking. The experimental results show that our proposed features can significantly improve the performances of the task in mixed-initiative human-human ...
متن کاملWikifyMe: Creating Testbed for Wikifers
Finding relationships between words in text and articles from Wikipedia is an extremely popular task known as wikification. However there is still no gold standard corpus for wikifiers comparison. We present WikifyMe, the online tool for collaborative work on universal test collection which allows users to easily prepare tests for two most difficult problems in wikification: word-sense disambig...
متن کاملChild readers’ eye movements in reading Thai
It has recently been found that adult native readers of Thai, an alphabetic scriptio continua language, engage similar oculomotor patterns as readers of languages written with spaces between words; despite the lack of inter-word spaces, first and last characters of a word appear to guide optimal placement of Thai readers' eye movements, just to the left of word-centre. The issue addressed by th...
متن کامل